Skip to content

fix(docling)!: change default export type to MARKDOWN and add page_number to chunk metadata#3276

Merged
bogdankostic merged 2 commits into
deepset-ai:mainfrom
SyedShahmeerAli12:fix/docling-metadata-consistency
May 11, 2026
Merged

fix(docling)!: change default export type to MARKDOWN and add page_number to chunk metadata#3276
bogdankostic merged 2 commits into
deepset-ai:mainfrom
SyedShahmeerAli12:fix/docling-metadata-consistency

Conversation

@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor

@SyedShahmeerAli12 SyedShahmeerAli12 commented May 6, 2026

Related Issues

Proposed Changes:

  • ExportType.MARKDOWN is now the default export type (previously DOC_CHUNKS), aligning DoclingConverter with Haystack's convention of separating conversion from chunking. Users who want chunked output should pass export_type=ExportType.DOC_CHUNKS explicitly.
  • MetaExtractor.extract_chunk_meta() now extracts page_number from chunk provenance info, making chunk metadata consistent with other Haystack splitters like DocumentSplitter.

How did you test it?

  • All 34 existing unit tests pass
  • Added 2 new unit tests: test_extract_chunk_meta_includes_page_number and test_extract_chunk_meta_page_number_uses_minimum

Notes for the reviewer

  • This is a breaking change: the default export_type has changed from DOC_CHUNKS to MARKDOWN. Existing pipelines that relied on the default without setting it explicitly will need to add export_type=ExportType.DOC_CHUNKS.
  • dl_meta is preserved in chunk metadata for backward compatibility alongside the new page_number field.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I have used one of the conventional commit types for my PR title: fix:

…ber to chunk metadata

- ExportType.MARKDOWN is now the default (was DOC_CHUNKS), aligning
  with Haystack convention of separating conversion from chunking
- MetaExtractor.extract_chunk_meta now extracts page_number from
  chunk provenance, making metadata consistent with other Haystack splitters
@SyedShahmeerAli12 SyedShahmeerAli12 requested a review from a team as a code owner May 6, 2026 07:28
@SyedShahmeerAli12 SyedShahmeerAli12 requested review from bogdankostic and removed request for a team May 6, 2026 07:28
@github-actions github-actions Bot added integration:docling type:documentation Improvements or additions to documentation labels May 6, 2026
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

SyedShahmeerAli12 commented May 6, 2026

hey @bogdankostic Resolves #3256 happy to get feedback on this!

Copy link
Copy Markdown
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @SyedShahmeerAli12! I added a comment about reverting the additions to the changelog as these are added automatically.

Also, I was wondering if we could add more metadata as pointed out in the issue like split_id and split_start_idx.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert these changes - the changelog will be populated automatically when a new released is triggered.

@bogdankostic bogdankostic changed the title fix(docling): change default export type to MARKDOWN and add page_number to chunk metadata fix(docling)!: change default export type to MARKDOWN and add page_number to chunk metadata May 6, 2026
@SyedShahmeerAli12
Copy link
Copy Markdown
Contributor Author

SyedShahmeerAli12 commented May 6, 2026

@bogdankostic Both points addressed ......

  • CHANGELOG: reverted removed the manually added [Unreleased] section
  • split_id / split_idx_start: added both fields to chunk metadata in the DOC_CHUNKS branch of run() (alongside the existing page_number). split_id is the 0-based chunk index and split_idx_start is the cumulative character offset based on chunk.text length both reset per source document, matching the behaviour of Haystack's DocumentSplitter. Tests updated and all 36 passing.

Copy link
Copy Markdown
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SyedShahmeerAli12, looking good to me! :)

@bogdankostic bogdankostic merged commit 28ada67 into deepset-ai:main May 11, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:docling type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make DoclingConverter metadata consistent with other converters

2 participants